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Abstract: This paper investigates the execution of tree-shaped task graphs using multiple 
processors. Each edge of such a tree represents a large 10 file. A task can only be executed if 
all input and output files fit into memory, and a file can only be removed from memory after it 
has been consumed. Such trees arise, for instance, in the multifrontal method of sparse matrix 
factorization. The maximum amount of memory needed depends on the execution order of the 
tasks. With one processor the objective of the tree traversal is to minimize the required memory. 
This problem was well studied and optimal polynomial algorithms were proposed. 
Here, we extend the problem by considering multiple processors, which is of obvious interest in 
the application area of matrix factorization. With the multiple processors comes the additional 
objective to minimize the time needed to traverse the tree, i.e., to minimize the makespan. Not 
surprisingly, this problem proves to be much harder than the sequential one. We study the compu- 
tational complexity of this problem and provide an inapproximability result even for unit weight 
trees. Several heuristics are proposed, each with a different optimization focus, and they are 
analyzed in an extensive experimental evaluation using realistic trees. 

Key-words: Scheduling, Memory-aware, Trees, Bi-objective optimization 
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Ordonnancement d'arbres de taches pour minimiser 
memoire et temps d 'execution 

Resume : Dans ce rapport, nous nous interessons au traitement d'arbres de taches par plusieurs 
processeurs. Chaque arete d'un tel arbre represente un gros fichier d'entree/sortie. Une tache peut 
etre traitee seulement si I'ensemble de ses fichiers d'entree et de sortie peut resider en memoire, et 
un fichier ne peut etre retire de la memoire que lorsqu'il a ete traite. De tels arbres surviennent, par 
exemple, lors de la factorisation de matrices creuses par des methodes multifrontales. La quantite 
de memoire necessaire depend de I'ordre de traitement des taches. Avec un seul processeur, 
I'objectif est naturellement de minimiser la quantite de memoire requise. Ce probleme a deja ete 
etudie et des algorithmes polynomiaux ont etc proposes. 

Nous etendons ce probleme en considerant plusieurs processeurs, ce qui est d'un interet evident 
pour le probleme de la factorisation de grandes matrices. Avec plusieurs processeurs se pose 
egalement Ic probleme de la minimisation d\i temps necessaire pour traitcr I'arbre. Noiis montrons 
que comme attendu, ce probleme est bien plus complique que dans le cas sequentiel. Nous etudions 
la complexite de ce probleme et nous fournissons des resultats d'inaproximabilite, meme dans le 
cas de poids unitaires. Nous proposons plusieurs heuristiques pour obtcnir im ordonnancement, 
qui se concentrent chacune sur un objectif different. Nous analysons leurs performances par une 
large campagne de simulations utilisant des arbres realistes. 

Mots-cles : Ordonnancement, Contrainte Memoire, Arbres, Optimisation bi-critere 
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1 Introduction 

Parallel workloads are often modeled as task graphs, where nodes represent tasks and edges rep- 
resent the dependencies between tasks. There is an abundant literature on task graph scheduling 
when the objective is to minimize the total completion time, or Makespan. However, as the size of 
the data to be processed is increasing, the memory footprint of the application must be optimized 
as it can have a dramatic impact on the algorithm execution time. This is best exemplified with 
an application which, depending on the way it is scheduled, will either fit in the memory, or will 
require the use of swap mechanisms or out-of-core techniques. There are very few existing studies 
on the minimization of the memory footprint when scheduling task graphs, and even fewer of them 
targeting parallel systems. 

We consider the following memory-aware parallel scheduling problem for rooted trees. The 
nodes of the tree correspond to tasks, and the edges correspond to the dependencies among the 
tasks. The dependencies are in the form of input and output files: each node takes as input several 
large files, one for each of its children, and it produces a single large file, and the different files 
may have different sizes. Furthermore, the execution of any node requires its execution file to be 
present; the execution file can be seen as the program of the task. We are to execute such a set 
of tasks on a parallel system made of p identical processing resources sharing the same memory. 
The execution scheme corresponds to a schedule of the tree where processing a node of the tree 
translates into reading the associated input files and producing the output file. How can the tree 
be scheduled so as to optimize the memory usage? 

Modern computing platforms exhibit a complex memory hierarchy ranging from caches to 
RAM and disks and even sometimes tape storage, with the classical property that the smaller 
the memory, the quicker. Thus, to avoid large running times, one usually wants to avoid the use 
of memory devices whose 10 bandwidth is below a given threshold: even if out-of-core execution 
(when large data are unloaded to disks) is possible, this requires special care when programming 
the application and one usually wants to stay in the main memory (RAM). This is why in this 
paper, we are interested in the question of minimizing the amount of main memory needed to 
completely process an application. 

Throughout the paper, we consider in-trees where a task can be executed only if all its children 
have already been executed. (This is absolutely equivalent to considering out-trees as a solution 
for an in-tree can be transformed into a solution for the corresponding out-tree by just reversing 
the arrow of time, as outlined in [2.) A task can be processed only if all its files (input, output, 
and execution) fit in currently available memory. At a given time, many files may be stored in 
the memory, and at most p tasks may be processed by the p processors. This is obviously possible 
only if all tasks and execution files fit in memory. When a task finishes, the memory needed for its 
execution file and its input files is released. Clearly, the schedule which determines the processing 
times of each task plays a key role in determining which amount of main memory is needed for a 
successful execution of the whole tree. 

The first motivation for this work comes from numerical linear algebra. Tree workflows (as- 
sembly or elimination trees) arise during the factorization of sparse matrices, and the huge size of 
the files involved makes it absolutely necessary to reduce the memory requirement of the factoriza- 
tion. The sequential version of this problem (i.e., with p — 1 processor) has already been studied. 
Liu [13] discusses how to find a memory-minimizing traversal when the traversal is required to 
correspond to a postorder traversal of the tree. In the follow-up study [13], an exact algorithm is 
shown to solve the problem, without the postorder constraint on the traversal. Recently, some of 
us [S] proposed another algorithm to find a memory-optimal traversal, which proved to be faster 
on existing elimination trees, although being of the same worst-case complexity (0(n^)). 

The parallel version of this problem is a natural continuation of these studies: when processing 
large elimination trees, it is very meaningful to take advantage of parallel processing resources. 
However, to the best of our knowledge, there exist no theoretical studies for this problem. The 
key contributions of this work are: 

• The proof that the parallel variant of the pebble game problem is NP-coniplete. This shows 
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that the introduction of niemory constraints, in the simplest cases, suffices to make the 
problem NP-hard. 

• The proof that no algorithm can simultaneously deliver a constant-ratio approximation for 
the memory minimization and for the makespan minimization. 

• A set of heuristics having different optimizing focus. 

• An exhaustive set of simulations using realistic tree shaped task graphs to assess the relative 
and absolute performance of these heuristics. 

The rest of this paper is organized as follows. Section [5] reviews related studies. The notation 
and formalization of the problem are introduced in Section |31 Complexity results are presented in 
Section 21 while Section [S] proposes different heuristics to solve the problem, which are evaluated 
in Section [6] 

2 Background and Related Work 

2.1 Sparse matrix factorization 

As mentioned above, determining a memory-efficient tree traversal is very important in sparse 
numerical linear algebra. The elimination tree is a graph theoretical model that represents the 
storage requirements, and computational dependencies and requirements, in the Cholesky and 
LU factorization of sparse matrices. In a previous study, we have described how such trees are 
built, and how the multifrontal method organizes the computations along the tree j^. This is the 
context of the founding studies of Liu |13l 114) on memory minimization for postorder or general 
tree traversals presented in the previous section. Memory minimization is still a concern in modern 
multifrontal solvers when dealing with large matrices. Among other, efforts have been made to 
design dynamic schedulers that takes into account dynamic pivoting (which impacts the weights 
of edges and nodes) when scheduling elimination trees with strong memory constraints [5], or to 
consider both task and tree parallelism with memory constraints [T]. While these studies try to 
optimize memory management in existing parallel solvers, we aim at designing a simple model to 
study the fundamental underlying scheduling problem. 

2.2 Scientific workflows 

The problem of scheduling a task graph under memory constraints also appears in the processing 
of scientific workflows whose tasks require large I/O files. Such workflows arise in many scientific 
fields, such as image processing, genomics or geophysical simulations. The problem of task graphs 
handling large data has been identified in [15) which proposes some simple heuristic solutions. 
Surprisingly, in the context of quantum chemistry computations. Lam et al. [11) have recently 
rediscovered the algorithm published in 1987 in [14) . 

2.3 Pebble game and its variants 

On the more theoretical side, this work builds upon the many papers that have addressed the 
pebble game and its variants. Scheduling a graph on one processor with the minimal amount of 
memory amounts to revisiting the I/O pebble game with pebbles of arbitrary sizes that must be 
loaded into main memory before firing (executing) the task. The pioneering work of Sethi and 
UUman [T7] deals with a variant of the pebble game that translates into the simplest instance 
of our problem when all input/output files have weight 1 and all execution files have weight 0. 
The concern in |17) was to minimize the number of registers that are needed to compute an 
arithmetic expression. The problem of determining whether a general DAG can be executed with 
a given number of pebbles has been shown NP-hard by Sethi [16) if no vertex is pebbled more than 
once (the general problem allowing recomputation, that is, re-pebbling a vertex which have been 
pebbled before, has been proven PsPACE complete [3]). However, this problem has a polynomial 
complexity for tree-shaped graphs [T7] . 
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To the best of our knowledge, there have been no attempts to extend these resuhs to paraUel 
machines, with the objective of minimizing both memory and total execution time. We present 
such an extension in Sectional 

3 Model and objectives 
3.1 Application model 

We consider in this paper a tree-shaped task-graph T composed of n nodes, or tasks, numbered 
from 1 to n. Nodes in the tree have an output file, an execution file (or program), and several 
input files (one per child). More precisely: 

• Each node i in the tree has an execution file of size and its processing on a processor takes 



• Each node i has an output file of size fi. If i is not the root, its output file is used as input 
by its parent parent{i); if i is the root, its output file can be of size zero, or contain outputs 
to the outside world. 

• Each non-leaf node i in the tree has one input file per child. We denote by Childrenii) the 
set of the children of i. For each child j G Children{i), task j produces a file of size fj for i. 
If z is a leaf- node, then Children{i) = and i has no input file: we consider that the initial 
data of the task either reside in its execution file or are read from disk (or received from the 
outside word) during the execution of the task. 

During the processing of a task i, the memory must contain its input files, the execution file, 
and the output file. The memory needed for this processing is thus: 



After i has been processed, its input files and program are discarded, while its output file is kept 
in memory until the processing of its parent. 

3.2 Platform model and objectives 

In this paper, our goal is to design the simpler platform model which allows to study memory 
minimization on a parallel platform. We thus consider p identical processors which share a single 
memory. We do not consider here a hard constraint on the memory, but we rather include memory 
in the objectives. We thus consider multi-criteria optimization with the following two objectives: 

• Makespan. Our first objective is the classical makespan, or total execution time, which 
corresponds to the times-span between the beginning of the execution of the first leaf task 
and the end of the processing of the root task. 

• Memory. Our second objective is the amount of memory needed for the computation. At 
each time step, some files are stored in the memory and some task computations occur, 
which induces a memory usage. The peak memory is the maximum usage of the memory 
over the whole schedule, which we aim at minimizing. 

4 Complexity results 

in the Pebble Game model 

Since there are two objectives, the decision version of our problem can be stated as follows. 



time Wi. 
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Figure 1: Tree used for the NP-completeness proof 



Definition 1 (BiObjectiveParallelTreeScheduling). Given a tree-shaped task graph T provided 
with memory weights and task durations, p processors, and two bounds Bc^^,^ and Bmem, is there 
a schedule of the task graph on the processors whose makespan is not larger than Scmax '^^'^ whose 
peak memory is not larger than Bmem ? 

This problem is obviously NP-complete. Indeed, when there are no memory constraints 
{Bmem = oo) and when the task tree does not contain any inner node, that is, when all tasks 
are either leaves or the root, then our problem is equivalent to scheduling independent tasks on 
a parallel platform which is an NP-complete problem as soon as tasks have different execution 
times 1^. On the contrary minimizing the makespan for a tree of same-size tasks can be solve 
in polynomial-time when there are no memory constraints [7]. In this section, we consider the 
simplest variant of the problem. We assume that all input files have the same size (V«,/i ~ 1) 
and no extra memory is needed for computation (Vi, = 0). Furthermore, we assume that the 
processing of each node takes a unit time: V«, ~ 1. We call this variant of the problem the 
Pebble Game model since it perfectly corresponds to pebble game problems introduced above: the 
weight /i = 1 corresponds to the pebble put on one node once it has been processed and its results 
is not yet discarded. Processing a node requires to put an extra pebble on this node and is done 
in unit time. 

In this section, we first show that even in this simple variant, the introduction of memory 
constraints (a limited number of pebbles) makes the problem NP-hard f Section 14. ip . Then, we 
show that when trying to minimize both memory and makespan, it is not possible to get a solution 
with a constant approximation ratio for both objectives (Section 14. 2p . 

4.1 NP-completeness 

Theorem 1. The BiObjectiveParallelTreeScheduling problem is NP-complete in the Pebble Game 
model (i.e., with Vijfi = Wi ^ 1, ni ^ 0). 

Proof. First, it is straightforward to check that the problem is in NP: given a schedule, it is easy 
to compute its peak memory and makespan. 

To prove the problem NP-completeness, we perform a reduction from 3-Partition, which is 
known to be NP-complete in the strong sense [2]. We consider the following instance Xi of the 
3- Partition problem: let be 3m integers and B an integer such that = mB. We consider 
the variant of the problem, also NP-complete, where \/i,B/A < Oi < B/2. To solve Ii, we need 
to solve the following question: does there exist a partition of the a^'s in m subsets 5*1, ... , Sm, 
each containing exactly 3 elements, such that, for each Sk, J2ieSk ~ build the following 

instance I2 of our problem, illustrated on Figure [T] The tree contains a root r with 3m children, 
the A^i's, each one corresponding to a value a^. Each node Ni has 3m x Ui children, which are leaf 
nodes. The question is to find a schedule of this tree on p = 3mB processors, whose peak memory 
is not larger than Bmem — 3m x B -\- 3m and whose makespan is not larger than Bc^^^ = 2m -I- 1. 

Assume first that there exists a solution to Xi, i.e., that there are m subsets Sk of 3 elements 
with X^ieSfc ~ ^- ^^^^ case, we build the following schedule: 

• At step 1, we process all the nodes and L^^ with Si = {a^j , a^^ , }. There are 

3mi3 = p such nodes, and the amount of memory needed is also 3mB. 
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• At step 2, we process the nodes Ni-^, Nj-^, Nk-^- The memory needed is 3mB + 3. 

• At step 2n + I, with 1 < n < m — 1, we process the 3mB = p nodes i^", i^", i^" with 
Sn = ) «j„ , afe„ }• The amount of memory needed is SmB + 3n (counting the memory for 
the output files of the Nt nodes previously processed) . 

• At step 2n + 2, with 1 < n < m — 1, we process the nodes Ni^, -^i„: ^fe„- The memory 
needed for this step is 3mB + 3(n + 1). 

• At step 2m + 1, we process the root node and the memory needed is 3m + 1. 

Thus, the peak memory of this schedule is Bmem and its makespan -Bc,„ax- 

On the contrary, assume that there exists a solution to problem I2, that is, that there exists 
a schedule of makespan at most Bc^^^ = 2m + 1. Without loss of generality, we assume that 
the makespan is exactly 2m + 1. We start by proving that at any step of the algorithm there are 
at most three of the Ni nodes that are processed. By contradiction, assume that four (or more) 
such nodes Ni^, Ni^, Ni.^, Ni^ are processed during a certain step. We recall that > i?/4 so that 
Oil + 0'i2 + +044 > B and thus a^^ + + + > B + 1. The memory needed at this step is 
thus at least {B + l)3m for the children of the nodes Ni^, Ni^, Ni,^, and Ni^ and 4 for the nodes 
themselves, hence a total of at least {B + l)3m + 4, which is more than the prescribed bound 
Bmem- Thus, at most three of Ni nodes are processed at any step. In the considered schedule, 
the root node is processed at step 2m + 1. Then, at step 2m, some of the Ni nodes are processed, 
and at most three of them from what precedes. The a^'s corresponding to those nodes make the 
first subset Si. Then all the nodes such that aj G Si must have been processed at the latest 
at step 2m — 1, and they occupy a memory footprint of 3m ^^ .^^ aj at steps 2m — 1 and steps 
2m. Let us assume that a node N^ is processed at step 2m — 1. For the memory bound Bmem 
not to be satisfied we must have + X^a gSi '^i — (Otherwise, we would need a memory of 
at least 3m{B + 1) for the involved nodes plus 1 for the node Nk). Therefore, node could 
have been processed at step 2m. We then modify the schedule so as to schedule Nk at step 2m 
and thus we add fc to 5'i. We can therefore assume, without loss of generality, that no Ni node is 
processed at step 2m — 1. Then, at step 2m — 1 only children of the Nj nodes with aj e Si are 
processed, and all of them are. So, none of them have any memory footprint before step 2m — 1. 
We then generalize this analysis: at step 2i, for 1 < j < m — 1, only some Nj nodes are processed 
and they define a subset 5*^; at step 2i — 1, for 1 < i < m — 1, are processed exactly the nodes 
that are children of the nodes Nj such that aj S Si . 

Because of the memory constraint, each of the m subsets of a^'s built above sum to at most B. 
Since they contain all aj's, their sum is mB. Thus, each subset Sk sums to B and we have built 
a solution for Ii . □ 

4.2 Joint minimization of both objectives 

As our problem is NP-complete, it is natural to wonder whether there exist approximation algo- 
rithms. Here, we prove that there does not exist schedules which approximates both the minimum 
makespan and the minimum memory with constant factor^. 

Theorem 2. There is no algorithm that is both an a -approximation for makespan minimization 
and a (3 -approximation for memory peak minimization when scheduling in-tree task graphs. 

Proof. To establish this result, we proceed by contradiction. We therefore assume that there is an 
integer a, an integer /3, and an algorithm A that processes any input tree 7~ in a time not greater 
than a times the optimal execution time while using a peak memory that is not greater than /? 
times the optimal peak memory. 

The tree. Figure [2] presents the tree used to derive a contradiction. This tree is made of n 
identical subtrees whose roots are the children of the tree root. The values of n and S will be fixed 
later on. 

-"^This is equivalent to say that there is no Zenith or simultaneous approximation. 
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root 




Figure 2: Tree used for establishing Theorem [51 

Optimal execution time. The optimal execution time is equal to the length of the critical 
path, as we have made no hypothesis on the number of available processors. The critical path has 
a length oi 6 + 2, which is the length of the path from the root to any b^g_^^, a^i^^^ , or 0,2^^^ node, 
with 1 < i < n. 

Optimal peak memory. Let us consider any sequential execution that is optimal with regard 
to the peak memory usage. Under this execution, let d\ be the last processed node among the d{ 
nodes, 1 < j < n. We consider the step at which node d\ is processed. As, by hypothesis, all the 
d\ nodes, 1 < j < n and j ^ i, have already been processed, there are in memory at that step at 
least n — 1 results. The processing of d\ requires (5+1 memory units as this node has S children. 
Hence, a total memory usage of at least {n — 1) + {5 + 1) — S + n ioi the processing of d\. This is 
obviously a lower bound on the optimal peak memory usage. We now show that this bound can 
be reached. 

We consider the following schedule: 

• Completely process first the subtree rooted at cpj, then the subtree rooted at cpf, and so 
on. 

• The subtree rooted at cp\ is processed as follows: for j going from 1 to (5 — 1, process the 
S — j + 1 children of node dj, then node dj; then process nodes bg_^^, b\, and nodes cp\_-^ to 
cp{. 

When the subtree rooted at cp\ is processed there are in memory exactly i — 1 results coming 
from the processing of the first i — 1 subtrees. These are exactly the results of the processing of 
the nodes cp\, cp]~^. 

The processing of node requires a memory oi 5 — j + 2, for 1 < j < S — 1: this node has 
S — j + 1 inputs and one output. When node is processed the memory contains j — 1 results due 
to the processing of the subtree rooted at cpj: the results of the processing of nodes d\ to d}_i. 
Hence, the total memory usage when node c?* is processed is (i — 1) + ((5 — j + 2) + (j — 1) = i + S. 

Accordingly when bg_^^^ is processed the memory usage is (i — + 1) + 1 — i + 5 ~ I, and 
when bg^-^ is processed it is {i — 1) + {6 — 1) + 2 = i + S. Later on, when node cp* is processed, 
for I < j < 6 — 1, the memory usage is {i — l)+j + 2 = i+ j + l < i + 5. Indeed, at that time, 
the only data in memory relative to the processing of the subtree rooted at cp\ are 1) the results 
of the nodes dl through c?*-, 2) the result of cp'j_^_l if j < (5 — 1 or of bg otherwise, and 3) the result 
of the processing of node cpj . 

Under this schedule, the peak memory usage during the processing of the subtree rooted at 
cp\ is i + S. The overall peak memory usage of the studied schedule is then n + S which is thus 
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the optimal peak memory usage. 

Lower bound on the peak memory usage of A. The peak memory usage is not smaller 
than the average memory usage. We derive the desired contradiction by using the average memory 
usage of algorithm ^ as a lower bound to its peak memory usage. 

By hypothesis, algorithm A is a competitive with regard to makespan minimization. Therefore 
the processing of the tree by algorithm A should complete at the latest at time a{d + 2). To ensure 
that, the n cp\ nodes, 1 < i < n, must all be executed at the latest at time a{5 + 2) — 1. Therefore, 
all the descendants of these nodes must be executed between time and time a{6 + 2) — 2. We 
now evaluate the rniniber of these descendants and their memory footprints. 

The descendants of node cp\ includes the two nodes bg and ^^'^ 6 — 2 nodes cp2 to cpg_■^^, 
the 5 — 1 nodes d\ through dg_-^^, and, finally, the descendants of the d*- nodes, for 1 < j < 5 — 1. 
As node d*- has 6 — j + 1 descendants, the number of descendants of node cp\ is: 




5(5-4 



<5-l 

2 + ((5 - 2) + ((5 - 1) + ^((5 - J + 1) = 2^ - 1 
j=i 

All together, the nodes cp\, for 1 < i < n have '^2^~^ descendants. 

We consider the memory footprint of each of these nodes between time step and time step 
a{6 + 2) — 2. The result of the processing of each of theses nodes must be in memory for at least 
two steps in this interval, the step at which the node is processed and the step at which its parent 
node is processed, except for the nodes dj , 1 < j < n, and cj?|, for 1 < fc < n, whose parents need 
not have been processed in that interval and thus need only to be present in memory during one 
time step. The overall memory footprint between time and a{5 + 2) — 2 is then: 



n (J^^-^-Y — ~ - 2^ X 2 + 2 X 1^ = n ((5^ + 5(5 - 6) . 



The average memory usage during that period is thus: 

n ((5^ + 5(5-6) 



a{5 + 2)-2 

This is obviously a lower bound on the overall peak memory usage. This bound enables us to 
derive a lower bound lb on the approximation ratio p of algorithm A with regard to memory usage: 

n(<5^+5<S-6) 
^ a(5+2)-2 n ((5^ + 5(5-6) 

p>lb= — 



n + 6 {a{S + 2) -2){n + d)' 
We then let 6 = n^. Therefore, 

n (n^ + 5n^ — 6) 



lb = 



(a(n2 +2) - 2)(n + n2)" 



Then, lb tends to +oo when n tends to infinity. There is thus a value no such that, for any value 
n > no, the right-hand side is greater than 2/3. We let n = no and we obtain: 

no {no + 5no — 6) 
= {a{nl + 2)-2){no + nl) " 

which contradicts the definition of /3. □ 



5 Heuristics 

Given the complexity of optimizing the makespan and memory at the same time, we have investi- 
gated heuristics and propose three algorithms: ParSubtrees, ParInnerFirst, and ParDeep- 
EStFirst. The intention is that the proposed algorithms cover a range of use cases, where the 
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optimization focus wanders between the makespan and the required memory. ParSubtrees em- 
ploys a memory-optimizing sequential algorithm for it subtrees, hence its focus is more on the 
memory side. In contrast, ParInnerFirst and ParSubtrees are list scheduling based algo- 
rithms, which should be stronger in the makespan objective. Nevertheless, ParInnerFirst tries 
to approximate a postorder in parallel, which is good for memory in sequential. ParDeepest- 
First's focus is fully on the makespan. 

The minimal memory requirement M is achieved by using the optimal sequential algorithm [3] , 
i.e., using p = 1 processor. Employing more processors cannot reduce the amount of memory 
required, yet the sequential algorithm is of course only a p-approximation of the optimal parallel 
makespan C^^^x- 



5.1 Heuristic ParSubtrees 

The most natural idea to process a tree T in parallel is arguably its splitting into subtrees and their 
subsequent parallel processing, each using the sequentially memory-optimal algorithms [Hllll]. An 
underlying idea is to give each processor a whole subtree in order to enable a lot of parallelism while 
also limiting the increase of the peak memory usage that can be observed when several processors 
work on the same subtree. Algorithm [T] outlines such an algorithm, together with the routine for 
splitting T into subtrees given in Algorithm [21 The makespan obtained using ParSubtrees is 
denoted by C^'^af 



Algorithm 1: ParSubtrees (T, p) 

1 Split tree T into q subtrees {q < p) and remaining set of nodes, using SplitSubtrees (T, 
P)- 

2 Concurrently process the q subtrees, each using memory minimizing algorithm, e.g. (S). 

3 Sequentially process remaining set of nodes, using memory minimizing algorithm. 



In this approach, q subtrees oi T, q < p, are processed in parallel. Each of these subtrees is 
a maximal subtree of T. In other words, each of these subtrees include all the descendants (in 
T) of its root. The nodes not belonging to the q subtrees are processed sequentially. These are 
the nodes where the q subtrees merge, the nodes included in subtrees that where produced in 
excess (if more than p subtrees where created), and the ancestors of these nodes. An alternative 
approach, as discussed below, is to process all subtrees in parallel, assigning more than one subtree 
to each processor, but Algorithm [T] allows us to find a makespan-opthnal splitting into subtrees, 
established shortly in Lemma [TJ 

As Wi is the computation weight of node i, Wi denotes the total computation weight (i.e., sum 
of weights) of all nodes in the subtree rooted in i, including i. SplitSubtrees uses a node priority 
queue PQ in which the nodes are sorted by non- increasing Wi, and ties are broken according to 
non-increasing Wj. head{PQ) returns the first node of PQ, while popHead{PQ) also removes it. 
PQ[i] denotes the i-th element in the queue. 

SplitSubtrees starts with the root and continues splitting the largest subtree (in terms of 
W) until this subtree is a leaf node {Whead{PQ) = Whead(PQ))- The execution time of Step 2 of 
ParSubtrees is that of the largest of the q subtrees, hence W^/iead(PQ) of the splitting. Splitting 
subtrees that are smaller than the largest leaf ( Wj < max^g^ Wi) cannot decrease the parallel time, 
but only increase the sequential time. More generally, given any splitting s of T into subtrees, the 
best execution time for s with ParSubtrees is achieved by choosing the p largest subtrees for 
the parallel Step 2. This can be easily derived, as swapping a large tree included in the sequential 
part with a smaller tree included in the parallel part cannot increase the total execution time. 

Lemma 1. SplitSubtrees returns a splitting ofT into subtrees that results in the makespan- 
optimal processing ofT with ParSubtrees. 

Proof. The proof is by contradiction. Let S be the splitting into subtrees selected by SplitSub- 
trees. Assume now that there is a different splitting Sopt which results in a shorter processing 
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Algorithm 2: SplitSubtrees (T, p) 



1 


Compute weights WijVi £T 




2 


PQ ^ root 




3 


seqSet <- 




4 


CostiO) = Wroot 




5 


s 


^ 1 


/* splitting rank */ 


6 


while Whead{PQ) > Whead(PQ) do 




7 




node ^ popHead{PQ) 




8 




seqSet ■(— seqSet U node 




9 




PQ Children (node) 




10 




,^PARSUBTREES/'„\ _ TJ/ 1 

^max l^J - Whead{PQ) + 


J2i£seqSet '^i + Sl=PQ[p+l] 


11 




S ^ S + 1 




12 


Select splitting x with C^f^f '™^'"''''(a;) 


— — 1 /^PARSUBTREES /J-N 



with ParSubtrees. 

Let r be the root node of a heaviest subtree in Sopt- Let i be the first step in SplitSubtrees 
where a node, say r^, of weight Wr is the head of PQ at the end of the step (rt is not necessarily 
equal to r, as there can be more than one subtree of weight Wr). There is always such a step t, 
because all subtrees are split by SplitSubtrees until at least one of the largest trees is a leaf 
node. By definition of r, there cannot be any leaf node heavier than Wr- The cost of the solution 
of step t is C^^^^^"^^^^^ {t) — Wr + Seq{t), hence parallel time plus sequential time, denoted by 
Seq{t). Seq{t) is the total weight of the sequential set seqSet plus the total weight of the surplus 
subtrees (that is, of all the subtrees except the p ones of largest weights). The cost of Sopt is 
^max — + Seq{Sopt), givcu that r is the root of a heaviest subtree of Sopt by definition. 

The splitting at step t (and any other splitting considered by SplitSubtrees) cannot be 
identical to Sopt, otherwise SplitSubtrees would have selected that splitting. All subtrees that 
were split in SplitSubtrees before step t were strictly heavier than Wr- Thus, there cannot exist 
any subtree in Sopt, whose subtrees are part of the splitting at step t. Hence for every subtree Tj 
in the splitting at step t the following property holds: either Tj is part of Sopt or a splitting of Tj 
into subtrees is part of Sopt- It directly follows that Seq{t) < Seq{Sopt), because every splitting 
of a tree into subtrees increases the sequential time by at least the root's weight. As the parallel 
time is identical for t and Sopt, namely Wr, it follows that C^^^^'^™'^^^ {t) < C^^x^ which is a 
contradiction to Sopt's shorter processing time. □ 

Complexity We first analyse the complexity of SplitSubtrees. Computing the weights Wi 
costs 0{n). Each insertion into PQ costs 0(log(n)) and calculating C^^^^'^™^^^ {s) in each step 
costs 0{p). Given that there are 0(n) steps, SplitSubtrees's complexity is 0{n{\og{n)+p)). The 
complexity of the sequential traversal algorithms used in Steps 2 and 3 of ParSubtrees cost at 
most O(n^), e.g., dUTl], or 0{n log(n)) if the optimal postorder suffices. Thus the total complexity 
of ParSubtrees is O(n^) or 0(nlog(n)), depending on the chosen sequential algorithm. 
ParSubtrees has the following guarantees for the memory requirement and makespan. 

Memory ParSubtrees is a (p + l)-approximation algorithm for peak memory minimization. 
During the parallel part of ParSubtrees the total memory used is less than p times the memory 
for the complete sequential execution (Mgeq), Mp < p ■ Mseq- This is because each of the p 
processors executes a maximal subtree and that the processing of any subtree uses, obviously, less 
memory (if done optimally) than the processing of the whole tree. During the sequential part of 
ParSubtrees the memory is bounded by Mg < Mseq + P ■ T^&^iGT fi < (p + l)Afseg, where the 
second term is for the output files produced by the up to p subtrees processed in parallel. Hence, 
in total: M < {p + I) Mseq 
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Figure 3: ParSubtrees is at best a p- approximation for the makespan. 



Makespan ParSubtrees delivers a ^-approximation algorithm for makespan minimization. In 
other words, the makespan achieved by ParSubtrees can be up to p times worse than the optimal 
makespan and thus may be not faster than the sequential execution. This can be derived readily 
with a tree of height 1 and p ■ k leaves (a fork) and Wi = 1, Vi e T, where fc is a large integer (this 
tree is depicted on Figure [3]). The optimal makespan for such a tree is C^^j^. — kp/p + 1 — k + 1. 
With ParSubtrees the makespan is Cmax = I + {1 + pk — p) = p{k — 1) + 2. When k tends to 
+CXD the ratio between the makespans tends to p. 

Given the just observed worst case for the makespan, a makespan optimization for ParSub- 
trees is to allocate all produced subtrees to the p processors instead of only p. This can be done 
by ordering the subtrees by non-increasing total weight and allocating each subtree in turn to 
the processor with the lowest total weight. Each of the parallel processors executes its subtrees 
sequentially. This optimized form of the algorithm shall be named ParSubtreesOptim. Note 
that this optimization should improve the makespan, but it will likely worsen the peak memory 
usage. 

5.2 Heuristic ParlnnerFirst 

ParSubtrees is a high level algorithm employing sequential memory-optimized algorithms. An 
alternative is to design algorithms that directly work on the tree in parallel and we present two 
such algorithms. From the sequential case it is known that a postorder traversal, while not optimal 
for all instances, provides good results f^. Our intention is to extend the principle of postorder 
traversal to the parallel processing. To do so we establish the following rules. 
Parallel Postorder: 

1. If an inner node (i.e., a non-leaf node) is ready to be processed (i.e., its input files are all in 
memory) then execute it. 

2. Otherwise, select and process the leaf node that is closest (in terms of edges to be traversed) 
to the previously selected leaf. 

These rules do not correspond to the usual formulation of postorder but, when applied using a 
single processor, they give rise to a postorder traversal of the tree. Due to the concurrent processing 
of nodes with p processors, the resulting order will not be a perfect postorder, but hopefully a 
close approximation. 

With the careful formulation of the parallel postorder we are able to base the heuristic on an 
event-based list scheduling algorithm [8j. Algorithm [3] outlines a generic list scheduling, driven by 
node finish time events. At each event at least one node has finished so at least one processor is 
available for processing nodes. Each available processor is given the respective head node of the 
priority queue. 

The order in which nodes are processed in Algorithm [3] is determined by two aspects: i) the 
node order O given as input; and ii) the ordering established by the priority queue PQ. 

For our proposed parallel postorder algorithm, called ParInnerFirst, the priority queue uses 
the following ordering: 1) inner nodes, ordered by non-increasing depth; 2) leaf nodes as ordered 
in the input order O. To achieve a parallel postorder, the node ordering O needs to be a sequential 
postorder. It makes heuristic sense that this postorder is an optimal sequential postorder, so that 
memory consumption can be minimized 
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Algorithm 3: List scheduling(T, p, O) 



1 Insert leaves in PQ, ordered as in O 

2 eventSet {0} 

3 while eventSet / do 
popHead{eventSet) 
Insert new ready nodes in PQ 
event */ 

Pa ^ available processors 
while ^ and PQ ^ 9 do 

proc <~ popHead{Pa); node popHead{PQ) 
Assign node to proc 
eventSet eventSet U 



/* ascending order */ 
/* event :node finishes */ 

/* available parents of nodes completed at 



finishTime{node) 




Figure 4: No memory bound for ParInnerFirst. 



Complexity The complexity of ParInnerFirst is that of determining the input order O and 
that of the list scheduling. Computing the optimal sequential postorder is O(nlogn) [13 . In the 
list scheduling algorithm there are 0{n) events and n nodes are inserted and retrieved from PQ. 
An insertion into PQ is O(logn), so the list scheduling complexity is O(nlogri). Hence, the total 
complexity is also 0{n\ogn). 

In the following we study the memory requirement and makespan of ParInnerFirst. 

Memory There is no limit on the required memory compared to the optimal sequential memory 
Mseq- This is derived considering the tree in Figure |H All output files have size 1 and the 
execution files have size (/i = 1, = for any node i of T). When optimally processing with 
p = 1, we process the leaves in a deepest first order. The resulting optimal memory requirement 
is Mseq = P + 1, reached when processing a join node. With p processors all leaves have been 
processed at the time the first join node (k — 1) can be executed. (The longest chain has length 
2k.) At that time there are {k — 1) ■ {p — 1) + 1 files in memory. When k tends to +cx) the ratio 
between the memory requirements also tends to +oo. 

Makespan ParInnerFirst schedule is a (2 — i)-approximation algorithm for makespan mini- 
mization because ParInnerFirst is a list scheduling algorithm [5]. 

5.3 Heuristic ParDeepestFirst 

The previous heuristic ParInnerFirst is motivated by good memory results for sequential 
postorder. Going the opposite direction, a heuristic objective can be the minimization of the 
makespan. For trees, all inner nodes depend on the leaf nodes, so it makes heuristic sense to try 
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Figure 5: Tree with long chains. 

to process the deepest nodes first to reduce any possible waiting time. For the paraUel processing 
of the tree, the most meaningful definition of the depth of a node i is the w-weighted length of 
the path from i to the root of the tree. This path length includes the Wi. The deepest node is the 
first node of the critical path of the tree. 

ParDeepestFirst is our proposed algorithm that does this. Due to the general nature of 
the list scheduling presented in Algorithm [31 we can implement ParDeepestFirst with it. To 
achieve the deepest first processing the priority queue PQ orders the nodes as follows: 1) deepest 
nodes first (in terms of w- weighted path length to root); 2) inner nodes before leaf nodes; 3) leaf 
nodes are ordered in the input order O. Note that the leaf order is only relevant for leaves of the 
same depth. This order should nevertheless be "reasonable", i.e., it should not alternate between 
leaves from different parents, which would be bad for the memory consumption. Such an order is 
again easily achieved when O is a sequential postorder. 

Complexity The complexity is the same as for ParInnerFirst, namely O(nlogn). See ParIn- 
nerFirst's complexity analysis. 

Now we study the memory requirement and the makespan of ParDeepestFirst. 

Memory The required memory of ParDeepestFirst is unbounded compared to the optimal 
sequential memory Mgeq- Consider the tree in Figure [S] with many long chains, assuming the 
Pebble Game model (i.e., fi = 1, ni — 0, and Wi — 1 for any node i of T). The optimal sequential 
memory requirement is 3. The memory usage of ParDeepestFirst will be proportional to the 
number of leaves, because they are all at the same depth, the deepest one. As we can build a 
tree like the one of Figure [S] for any predefined number of chains, the ratio between the memory 
required by ParDeepestFirst and the optimal one is unbounded. 

Makespan ParDeepestFirst schedule is a (2 — i)-approximation algorithm for makespan 
minimization because ParDeepestFirst is, like ParInnerFirst, a list scheduling algorithm [S]. 

6 Experimental validation 

In this section, we experimentally compare the heuristics proposed in the previous section, and 
we compare their performance to lower bounds. 

6.1 Setup 

All heuristics have been implemented in C. Special care has been devoted to the implementation 
to avoid complexity issues. Especially, priority queues have been implemented using binary heap 
to allow for O(logn) insertion and minimum extractior{^. 

^ The code and the data sets are available online at ]http : //graal ■ ens-lyon ■ f r/- Imarchal/schedullng-trees/ 1 
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Instead of implementing an intricate algorithm with O(n^) complexity such as Liu's algo- 
rithm |14| to obtain minimum sequential memory, we have chosen to estimate this minimum 
memory using the optimal post-order traversal. We have shown in [S] that this traversal was 
optimal in 95.8% of the tested cases, with an average increase of 1% with respect to the optimal. 
This justifies this choice. Since the reference sequential task-graph traversal serves as a basis for 
ordering nodes in a number of our heuristics, a large complexity would be prohibitive for this first 
step. 

6.2 Data set 

The data set contains assembly trees of a set of sparse matrices obtained from the University 
of Florida Sparse Matrix Collection (http://www.cise.ufl.edU/research/sparse/matrices/j. 
The chosen matrices satisfy the following assertions: not binary, not corresponding to a graph, 
square, having a symmetric pattern, a number of rows between 20,000 and 2,000,000, a number 
of nonzeros per row at least equal to 2.5, and a number of nonzeros per row at most equal 
to 5,000,000; and each chosen matrix has the largest number of nonzeros among the matrices 
in its group satisfying the previous assertions. At the time of testing there were 76 matrices 
satisfying these properties. We first order the matrices using MeTiS [10 (through the MeshPart 
toolbox [4]) and amd (available in Matlab), and then build the corresponding elimination trees 
using the symbf act routine of Matlab. We also perform a relaxed node amalgamation on these 
elimination trees to create assembly trees. We have created a large set of instances by allowing 
1, 2, 4, and 16 (if more than 1.6 x lO'^ nodes) relaxed amalgamations per node. At the end we 
compute memory weights and processing times to accurately simulate the matrix factorization: 
we compute the memory weight n-i of a node as rj^ + 2r]{^ — 1), where rj is the number of nodes 
amalgamated, and fj, is the number of nonzeros in the column of the Cholesky factor of the matrix 
which is associated with the highest node (in the starting elimination tree); the processing cost 
Wi of a node is defined as 2/?>rf -t- rf^fi — 1) -I- rj{^ — 1)^ (these terms corresponds to one gaussian 
elimination, two multiplications of a triangular rj x rj matrix with a rj x (fi — I) matrix, and one 
multiplication of a (/i — 1) x 77 matrix with a 77 x (/i — 1) matrix). The memory weights fi of edges 
are computed as (/i — 1)^. 

The resulting 608 trees contains from 2,000 to 1,000,000 nodes. Their depth ranges from 12 
to 70,000 and their maximum degree ranges from 2 to 175,000. Each heuristic is tested on each 
tree using p ~ 2, A, 8, 16, and 32 processors. Then the memory and makespan of the resulting 
schedules are evaluated by simulating a parallel execution. 

6.3 Results 



Heuristic 


Best memory 


Within 5% of 
best memory 


Avg. deviation from 
optimal (seq.) memory 


Best makespan 


Within 5% of 
best makespan 


Avg. deviation 
from best makespan 


ParSubtrees 


81.1 % 


85.2 % 


133.0 % 


0.2 % 


14.2 % 


34.7 % 


ParSubtreesOptim 


49.9 % 


65.6 % 


144.8 % 


1.1 % 


19.1 % 


28.5 % 


ParInnerFirst 


19.1 % 


26.2 % 


276.5 % 


37.2 % 


82.4 % 


2.6 % 


ParDeepestFirst 


3.0 % 


9.6 % 


325.8 % 


95.7 % 


99.9 % 


0.0 % 



Table 1: Proportions of scenarii when heuristics reach best (or close to best) performance, and 
average deviations from optimal memory and best achieved makespan. 



The comparison of the heuristics is summarized in Table [TJ It shows that ParSubtrees and 
ParSubtreesOptim are the best heuristics for memory minimization. On average they use less 
than 2.5 times the amount of memory required by the best sequential postorder (whose memory 
usage is very close to the optimal sequential memory as noted above) , when ParInnerFirst and 
ParDeepestFirst need respectively 3.7 and 5.2 times this amount of memory. ParInnerFirst 
and ParDeepestFirst perform best for makespan minimization, having makespans very close 
on average to the best achieved ones. As the scheduling problem, without memory constraints, 
is already NP-hard, we do not know what the optimal makespan is. We have seen however that 
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Figure 6: Comparison to lower bounds. 



ParInnerFirst and ParDeepestFirst are 2-approximation algorithms for the makespan. Fur- 
thermore, given the critical path oriented node ordering, we can expect that ParDeepestFirst's 
makespan is close to optimal. ParInnerFirst outperforms ParInnerFirst for makespan mini- 
mization, at the cost of a noticeable increase in memory. ParSubtrees and ParSubtreesOptim 
may be better trade-offs, since their average deviation from best makespan is under 35%. 
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Figure 7: Comparison to ParSubtrees. 

Figures[ni[71 and[5]provide complete results of the simulations. In each figure, a point represent 
one scenario (one heuristic on one tree with a given number of processors). To better visualize 
the distribution, we also plot a "cross" for each heuristic: the center of this cross is the average 
performance, while the branches represent the scope of each objective between the 10th and the 
90th percentile of the distribution. 

On Figure HI we plot the results of all simulations compared to some estimations of the lower 
bounds. The lower bound for memory minimization is the memory usage of the best sequential 
postorder, which is known to be very close to the optimal sequential traversal. The lower bound for 
the makespan is the maximum between the total processing time of the tree divided by the number 
of processors, and the maximum weighted critical path. This figure exhibits the same trends for 
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Figure 8: Comparison to ParInnerFirst. 

average values as noted in Table [1] When the maximum deviation from the lower bound on the 
makespan is around 4, the ratio of the parallel memory usage to the optimal sequential one can 
be far larger, as it is larger than 100 for the extreme cases. 

In the following figures, the results of the heuristics is normalized by the results of Par- 
Subtrees (Figure [T]) or ParInnerFirst (Figure [5]). As expected, ParSubtreesOptim gives 
results close to those of ParSubtrees, with better makespans but slightly worse memory usage. 
ParDeepestFirst always use more memory than ParInnerFirst, while having comparable 
makespans. In most cases, ParInnerFirst gives slightly better makespan than ParSubtrees, 
but uses more memory. 

7 Conclusion 

In this study we have shown that the parallel version of the pebble game on trees is NP-complete, 
hence stressing the negative impact of the memory constraints on the complexity of the problem. 
More importantly, we have shown that there does not exist any algorithm that is simultaneously an 
approximation algorithm for both makespan minimization and peak memory usage minimization 
when scheduling tree-shaped task graphs. We have thus designed heuristics for this problem. We 
have assess their performance using real task graphs arising from sparse matrices computation. 
These simulations showed that two of the heuristics, ParSubtrees and ParSubtreesOptim, 
only needed, for their parallel executions, and on average, 2.5 times the sequential memory, while 
achieving makespans that were less than 35% larger than best achieved ones. These heuristics 
appear thus to deliver interesting trade-offs between memory usage and execution times. In the 
future work, we will consider designing scheduling algorithms that take as input a cap on the 
memory usage. 
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